5 research outputs found

    Ethical scaling for content moderation: Extreme speech and the (in)significance of artificial intelligence

    Get PDF
    In this article, we present new empirical evidence to demonstrate the severe limitations of existing machine learning content moderation methods to keep pace with, let alone stay ahead of, hateful language online. Building on the collaborative coding project “AI4Dignity” we outline the ambiguities and complexities of annotating problematic text in AI-assisted moderation systems. We diagnose the shortcomings of the content moderation and natural language processing approach as emerging from a broader epistemological trapping wrapped in the liberal-modern idea of “the human”. Presenting a decolonial critique of the “human vs machine” conundrum and drawing attention to the structuring effects of coloniality on extreme speech, we propose “ethical scaling” to highlight moderation process as political praxis. As a normative framework for platform governance, ethical scaling calls for a transparent, reflexive, and replicable process of iteration for content moderation with community participation and global parity, which should evolve in conjunction with addressing algorithmic amplification of divisive content and resource allocation for content moderation

    Ob-Ugric database : Corpus and lexicon databases of Khanty and Mansi dialects

    Get PDF
    In this paper we describe the data processing procedures and the preliminary results of the project Ob-Ugric database (OUDB), a web-based framework which aims at developing corpus-based descriptive resources of Khanty and Mansi dialects. Using established language documentation and annotation tools, OUDB provides interlinked corpus and lexicon data from digitized texts as well as recent fieldwork studies in an uniform IPA-transcription together with the corresponding audio recordings thus making these less described languages of the Ob-Ugric branch of the Finno-Ugric language family accessible for researchers as well as the language community and archiving the raw data for documentation, linguistic evaluation and possible future use in building resources for language technology applications

    Reference tracking mechanisms and automatic annotation based on Ob-Ugric information structure

    Get PDF
    The following paper is concerned with information structure in the Ob-Ugric languages and its manifestation in reference tracking and its mechanisms. We will show how both knowledge on information structure and on reference tracking mechanisms can be used to develop a system for a (semi-)automatic annotation of syntactic, semantic and pragmatic functions. We assume that the principles of information structure, i.e., the balancing of the content of an utterance, are indicated by the use of anaphoric devices to mark participants in an on-going discourse. This process in which participants are encoded by the speaker and decoded by the hearer is called reference tracking. Our model distinguishes four important factors that play a role in reference tracking: inherent (linguistic) features of a referent, information structure, referential devices and referential strategies. The interaction between these factors we call reference tracking mechanisms. Here, the passive voice and the dative shift are used to exemplify this complex interaction system. Drawing conclusions from this, rules are developed to annotate both syntactic, semantic and pragmatic roles of discourse participants (semi-)automatically

    Listening to Affected Communities to Define Extreme Speech: Dataset and Experiments

    Get PDF
    Building on current work on multilingual hate speech (e.g., Ousidhoum et al. (2019)) and hate speech reduction (e.g., Sap et al. (2020)), we present XTREMESPEECH, a new hate speech dataset containing 20,297 social media passages from Brazil, Germany, India and Kenya. The key novelty is that we directly involve the affected communities in collecting and annotating the data – as opposed to giving companies and governments control over defining and combatting hate speech. This inclusive approach results in datasets more representative of actually occurring online speech and is likely to facilitate the removal of the social media content that marginalized communities view as causing the most harm. Based on XTREMESPEECH, we establish novel tasks with accompanying baselines, provide evidence that cross-country training is generally not feasible due to cultural differences between countries and perform an interpretability analysis of BERT’s predictions
    corecore